Performance Analysis Of All-to-All Communication on the Blue Gene/L Supercomputer
نویسندگان
چکیده
All-to-all communication is a well known performance bottleneck for many applications. For such applications to scale to a large number of processors, optimizing all-to-all communication is critical. In this paper, we analyze the performance of all-to-all communication on the Blue Gene/L torus interconnection network, which has limited bisection bandwidth. The torus interconnect topology has link contention even for all-to-all communication operations with short messages. We observed that the performance of all-to-all communication also depends on the shape of the processor partition. We present a performance analysis of all-to-all communication on mesh and torus partitions of various shapes and sizes. We then present optimization schemes to enhance the performance of all-to-all communication. The large message optimization substantially improves all-to-all performance on an asymmetric torus. In particular, performance improved from about 70% to over 99% of peak on a 20,480 (40 × 32 × 16) node configuration, which was the largest machine to which we had access. The short message optimization can double all-to-all performance for very short messages.
منابع مشابه
Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer
BlueGene/L is a massively parallel supercomputer that is currently the fastest in the world. Implementing MPI, and especially fast collective communication operations can be challenging on such an architecture. In this paper, I will present optimized implementations of MPI collective algorithms on the BlueGene/L supercomputer and show performance results compared to the default MPICH2 algorithm...
متن کاملPerformance Measurements of the 3D FFT on the Blue Gene/L Supercomputer
This paper presents performance characteristics of a communicationsintensive kernel, the complex data 3D FFT, running on the Blue Gene/L architecture. Two implementations of the volumetric FFT algorithm were characterized, one built on the MPI library using an optimized collective all-to-all operation [2] and another built on a low-level System Programming Interface (SPI) of the Blue Gene/L Adv...
متن کاملVersatile Communication Algorithms for Data Analysis
Large-scale parallel data analysis, where global information from a variety of problem domains is resolved in a distributed memory space, relies on communication. Three communication algorithms motivated by data analysis workloads—merge based reduction, swap based reduction, and neighborhood exchange—are presented, and their performance is benchmarked. These algorithms communicate custom data t...
متن کاملModel and simulation of exascale communication networks
Exascale supercomputers will have millions or even hundreds of millions of processing cores and the potential for nearly billion-way parallelism. Exascale compute and data storage architectures will be critically dependent on the interconnection network. The most popular interconnection network for current and future supercomputer systems is the torus (e.g., k-ary, n-cube). This paper focuses o...
متن کاملToward the Graphics Turing Scale on a Blue Gene Supercomputer
We investigate raytracing performance that can be achieved on a class of Blue Gene supercomputers. We measure a 822 times speedup over a Pentium IV on a 6144 processor Blue Gene/L. We measure the computational performance as a function of number of processors and problem size to determine the scaling performance of the raytracing calculation on the Blue Gene. We find nontrivial scaling behavior...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007